Back

JAMIA Open

Oxford University Press (OUP)

Preprints posted in the last 30 days, ranked by how well they match JAMIA Open's content profile, based on 37 papers previously published here. The average preprint has a 0.06% match score for this journal, so anything above that is already an above-average fit.

1
Stigmatizing Language Detection in Opioid Use Disorder Patient-Directed Discharge Clinical Documentation: A Privacy-Preserving Analysis Using a Locally Deployed Large Language Model

Izzo, J. A.; McIntyre, A. M.; Nguyen, J.; Bashaw, D.; Torrance, C. A.; Foster, J.

2026-06-01 health informatics 10.64898/2026.05.29.26354402 medRxiv
Top 0.1%
14.1%
Show abstract

Objective: Stigmatizing language in the electronic health record (EHR) has been associated with adverse patient experience in substance use disorder care, including opioid use disorder (OUD). This study evaluated a privacy-preserving, locally-deployed large language model as a method to detect stigmatizing language documentation in OUD patients with patient-directed discharge (PDD). Methods: A retrospective cohort study of 477 inpatient admissions from the MIMIC-IV database with a diagnosis of opioid use disorder were classified using a locally deployed Gemma-4-31b-it-bf16 model and predefined 140 term lexicon to identify stigmatizing language in clinical documentation. Results: Analysis of clinical documentation showed stigmatizing language was present in 84.1% (190/226) in the PDD cohort vs 62.2% (156/251) in the non-PDD cohort, with an unadjusted odds ratio of 3.21 (95% CI 2.07-4.98; p < 0.0001). After adjustment for age, sex, insurance status, marital status, and race, PDD discharge remained an independent predictor of stigmatizing documentation (aOR 2.24, 95% CI 1.40-3.59; p < 0.0001). Further analysis of stigma intensity showed higher stigmatizing markers in the PDD cohort vs the non-PDD cohort (2.85 {+/-} 2.39 vs 2.02 {+/-} 2.44; p < 0.0001). Discussion and Conclusion: Stigmatizing language is detected with increased frequency and prevalence in clinical documentation of OUD patients that initiate PDD compared to those that adhere to standard discharge processes. A locally deployed large language model (LLM) offers a scalable, privacy-preserving method to audit clinical documentation for stigmatizing language.

2
Evaluating Large Language Models for Translating Multimodal Phenotype Documentations into Executable EHR Phenotyping Algorithms

Yan, C.; Xin, Y.; Su, W.-C.; Gangireddy, S.; Durbhakula, S.; Bruehl, S. P.; Dickson, A. L.; Li, L.; Feng, Q.; Malin, B. A.; Derr, T.; Wei, W.-Q.

2026-05-22 health informatics 10.64898/2026.05.20.26353690 medRxiv
Top 0.1%
10.4%
Show abstract

Research applications of electronic health record (EHR) phenotypes require translating clinical definitions into executable EHR database queries, a labor-intensive process. We evaluated two frontier large language models across five phenotypes and three documentation modalities. Both models captured high-level logic from structured text but degraded markedly with diagram-only input. Error analysis revealed seven failure categories. Documentation, rather than model capability, was the primary bottleneck, reinforcing the need for standardization and expert oversight.

3
Machine Learning Estimation of Gestational Age at Delivery Using Linked Mother-Infant Electronic Health Records Across Two Health Systems

Bejan, C. A.; Yang, X.; Pham, A.; Qassem, L.; Abraham, A. A.; Choi, L.; Rosenbloom, S. T.; Gamire, L. X.; Phillips, E. J.

2026-05-25 obstetrics and gynecology 10.64898/2026.05.23.26353959 medRxiv
Top 0.1%
10.1%
Show abstract

Objective This study aimed to train and evaluate supervised machine learning algorithms using electronic health record (EHR) data to accurately estimate gestational age at delivery. <br>Materials and Methods We trained random forest, gradient boosting, and ensemble models on EHR data of mother-infant dyads from Vanderbilt University Medical Center(VUMC) and replicated the analyses at University of Michigan (UMich). We further analyzed EHR predictors of gestational age, assessed temporal drift in EHR data elements, and evaluated model performance stratified by delivery status. <br>Results The study included pregnancies corresponding to 54,344 and 34,345 mother-infant dyads at VUMC (2005-2025) and UMich (2012-2024), respectively. The gestational age predictions of the ensemble models achieved the highest agreement with the reference standard on the VUMC dataset ({+/-}1 week: 85.2%, {+/-}2 weeks: 94.3%, MAE: 4.4 days) and demonstrated stronger generalization on the UMich dataset ({+/-}1 week: 93.1%, {+/-}2 weeks: 97.8%, MAE: 2.8 days). Further, performance was better among pregnancies delivered in more recent years, and among full- and late-term deliveries compared with preterm deliveries. <br>Discussion The results indicate that supervised machine learning methods leveraging linked mother-infant EHRs can accurately estimate gestational age at delivery, while demonstrating the generalizability of the modeling approach and the portability of the analytic workflow across healthcare sites. <br>Conclusion This study presents a robust and generalizable machine learning framework to estimate gestational age at delivery. The framework can be reliably used to impute gestational age in large-scale, real-world clinical studies to support maternal and neonatal health research, in which accurate estimation of pregnancy onset is critical.

4
Professionalism Pulse: Development and Validation of a Natural Language Processing Pipeline and Dashboard for Safety Culture Surveillance in NYC Health + Hospitals

Mangut, E.; Wallace, R.

2026-05-22 health informatics 10.64898/2026.05.19.26353620 medRxiv
Top 0.1%
9.8%
Show abstract

Background: Professionalism and effective communication are foundational determinants of patient safety and quality of care. Unprofessional behaviors frequently serve as active precursors to adverse clinical events. However, proactive organizational surveillance is often hindered because incident feedback exists primarily as unstructured, free-text data. This study aimed to develop and validate a Natural Language Processing (NLP) pipeline and interactive dashboard to proactively monitor the "professionalism climate" within NYC Health + Hospitals, the largest municipal healthcare delivery system in the United States. Methods: A high-fidelity synthetic dataset (N=400) was computationally generated to safely mirror historical incident logs across 11 acute facilities without utilizing Protected Health Information (PHI). A rule-based NLP pipeline was developed in R utilizing the tidytext package. Unstructured narrative feedback was tokenized and classified into three core domains: Respect, Safety, and Communication. To validate the pipeline's accuracy, a 25% random stratified sample (n=100) was evaluated against independent, blinded manual coding performed by two reviewers, with inter-rater reliability measured via Cohen's Kappa. Finally, an interactive Tableau dashboard was developed to operationalize and visualize these metrics for ongoing surveillance. Results: The NLP algorithm achieved an overall accuracy of 85.8% (95% CI: 79.0-92.6), with 81.2% sensitivity and 88.9% specificity. The highest domain-specific performance was observed in Communication (88.0% accuracy). Manual validation demonstrated strong inter-rater reliability (k=0.84). Operational analysis via the dashboard revealed that 61.8% of reports occurred during the Tour 2 shift (15:00 to 23:00), aligning with peak operational volume. Furthermore, Respect-related feedback was reported at a disproportionately high frequency during the Tour 3 shift (23:00 to 07:00), accounting for over 50.7% of overnight feedback submissions. Conclusion: Rule-based NLP successfully transforms qualitative healthcare feedback into structured, actionable intelligence with high specificity. Integrating this pipeline into operational dashboards transitions safety culture surveillance from a reactive, manual exercise to a proactive, scalable system, enabling targeted, data-driven interventions by hospital leadership.

5
Combining centralized and decentralized approaches to assess and ensure data quality in Eurocrine(R) via Microsoft Power BI and DataquieR

Musholt, T. J.; Clerici, T.; Bergenfelz, A.; Schmidt, C. O.; Struckmann, S.

2026-06-05 health informatics 10.64898/2026.06.04.26354884 medRxiv
Top 0.1%
9.8%
Show abstract

Background: Medical registries have gained importance in the evaluation of healthcare quality outcomes. In the absence of high-quality evidence, such as randomized controlled trials, studies based on registry data are essential for informing clinical guidelines. Methods for assessing data quality are rarely described in detail. To ensure the credibility of registry-based studies, registries must use all available technical and operational means to guarantee high data quality. Method: Eurocrine(R) is a pan-European endocrine surgical database and quality registry initially funded by the EU healthcare programme, which started in 2015 and now includes more than 200,000 interventions as of April 2025. To ensure high data quality, interactive and standardized reports are created via Microsoft Power BI, which are created both centrally and locally. In addition, comprehensive data quality analyses were performed via the R-based package dataquieR. Results: Although a multitude of technical measures (for example, input screen design and real-time plausibility checks during data entry) are in place, they are not sufficient to prevent human errors at data entry. Errors identified in the reports were corrected, and preventive measures were implemented. Overall, the data quality was assessed as very good in terms of completeness, accuracy, and consistency. Conclusion: It is very important to provide registry users with an efficient and smart tool to identify data issues, as they have the clinical information to correct them. Data quality reports generated with dataquieR represent an effective tool for registry administrators. Predesigned Microsoft Power BI reports enable participating Eurocrine(R) clinics to self-audit their data.

6
Impact of pharmacist board certification on health outcomes of critically ill patients: An analysis of the Optimizing Pharmacist-Team Integration for ICU patient Management (OPTIM) study

Smith, S. E.; Henry, K.; Heavner, M.; Keedy, C.; Duong, H.; Chen, Z.; Chen, X.; OPTIM Investigator Team, ; Sikora, A.

2026-06-02 intensive care and critical care medicine 10.64898/2026.05.26.26353672 medRxiv
Top 0.1%
8.7%
Show abstract

BACKGROUND: Critical care pharmacists (CCPs) reduce adverse drug events (ADEs) and mortality in the intensive care unit (ICU). Board certification is the established professional standard for CCPs but its impact on ICU patient outcomes, including its relationship between CCP characteristics and workload, remain unclear. The purpose of this study was to evaluate the association between pharmacist board certification, CCP workload characteristics, and patient outcomes. METHODS: This was a pre-planned analysis of the multicenter, observational Optimizing Pharmacist Team Integration for ICU Patient Management (OPTIM) study, including adult ICU patients cared for by CCPs. Patients cared for exclusively by board certified pharmacists on every ICU day were categorized as the BCP group; those with at least one day of care from a non board certified pharmacist comprised the non BCP group. The primary outcome was hospital mortality; secondary outcomes included the hazard of discharge alive (HDA) from the ICU and hospital. Multivariable logistic regression was used to evaluate the association between BCP and mortality; Fine-Gray competing risk models were used to assess the relationship between BCP and ICU and hospital HDA. RESULTS: A total of 201 pharmacists (184 BCPs; 17 non BCPs) from 63 institutions caring for 20,537 ICU patients were included. Care provided exclusively by a BCP (vs. >/= 1 day by a non-BCP) was associated with lower mortality (OR 0.80, 95% CI 0.69 to 0.92, p=0.002) and both a higher ICU HDA (HR 1.08, 95% CI 1.03 to 1.13, p<0.001) and hospital HDA (HR 1.19, 95% CI 1.13 to 1.26, p<0.001). CONCLUSION: Daily ICU care delivered by pharmacists with board certification was independently associated with reduced mortality and improved hazard of discharge alive from the ICU. Board-certified pharmacists may enhance the quality and/or efficiency of critical care pharmacy services. These findings support the role of board certification as a modifiable factor to improve patient outcomes and optimize workload in the ICU.

7
PheBee: A Graph-Aware System for Scalable, Traceable, and Semantic Phenotyping

Gordon, D. M.; Homilius, M.; Antoniou, A. A.; Grannis, C.; Lammi, G. E.; Herman, A. C.; Kubatko, A.; Chaudhari, B. P.; White, P.

2026-05-13 health informatics 10.64898/2026.05.09.26352812 medRxiv
Top 0.1%
8.6%
Show abstract

ObjectivesPhenotype-driven workflows in clinical and translational research require standardized ontology-based representation, ontology-aware cohort discovery, and provenance inspection for each assertion. Existing approaches optimize either for semantic traversal or scalable batch analytics, but not both. We describe PheBee, a hybrid system that links semantic assertions to scalable evidence storage via a deterministic identifier, preserving provenance while supporting ontology-aware discovery at cohort scale. Materials and MethodsPheBee represents phenotype assertions in a knowledge graph as ontology-linked nodes with clinical modifier context (e.g., negated, family history), and stores supporting evidence records in a scalable row-oriented evidence table for cohort-scale access. The two layers are connected by a deterministic identifier enabling stable joins across repeated ingestions without duplicating high-volume evidence in the graph. We evaluated PheBee using synthetic datasets designed to exercise end-to-end ingestion and query workflows. ResultsFunctional evaluation validated hierarchical term expansion, qualifier-aware retrieval, duplicate-free assertion handling under re-ingestion, and privacy-conscious management of subjects shared across multiple research projects. At scale (10,000 subjects producing 12M evidence records) PheBee completed ingestion in [~]30 minutes and responded to interactive queries within 6 seconds under concurrent load. DiscussionPheBee exposes a unified API for ontology-aware cohort discovery with hierarchical term expansion, subject-centric retrieval of phenotypes and clinical modifiers, and evidence and provenance queries. Its data model aligns with GA4GH Phenopackets, facilitating interoperability with phenotype exchange standards. ConclusionBy combining ontology-aware semantics with scalable, provenance-bearing evidence storage, PheBee provides a practical open-source foundation for phenotype-driven research workflows that demand both semantic precision and cohort-scale traceability. LAY SUMMARYResearchers often use "phenotypes" (observable clinical features) to describe individual subjects and find groups of similar subjects. Those phenotypes come from many sources and need both standard terminology and clear evidence for why a phenotype has been associated with a subject. PheBee is a software system that stores phenotype assertions in a way that supports both "ontology-aware" searching (for example, finding patients with any subtype of a condition) and scalable storage of supporting evidence across large research cohorts. PheBee uses multiple types of data storage so researchers can perform interactive phenotype searches and also store millions of pieces of supporting evidence. A shared identifier connects the two storage layers, so subjects phenotypes and their supporting evidence remain linked even as new data is added over time. We evaluated PheBee using fully synthetic (non-patient) data to confirm correct query behavior, evidence traceability, and system performance at large scale.

8
Frontier Large Language Models for Comprehensive Medication Review in CKD Patients with Polypharmacy: A Trap-Embedded Synthetic Benchmark

Chuang, K.-C.; Lin, H.-J.; Lin, H.-M.

2026-05-26 health informatics 10.64898/2026.05.23.26353939 medRxiv
Top 0.1%
8.4%
Show abstract

Background: Patients with CKD and polypharmacy face high rates of drug-related problems, yet comprehensive medication review remains time-intensive and inconsistently performed. Large language models (LLMs) may augment this process, but existing benchmarks use multiple-choice formats that do not reflect open-ended, nephrology-specific review. We developed a trap-embedded synthetic CKD benchmark and evaluated five current-generation LLMs (GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, Grok 4.1 Fast, DeepSeek R1; tested April-May 2026) for open-ended medication review. Methods: Fifty synthetic CKD cases across three complexity groups (G3a-G3b [n=20], G4 [n=15], G5/G5D/transplant [n=15]) with 8-12 medications and [&ge;]2 embedded clinical traps each were scored against nephrologist-adjudicated gold standards. Each model produced three independent responses per case (temperature 0; 750 total outputs). Primary endpoint was per-case macro F1; secondary endpoints were safety-critical omission rate, PI-adjudicated hallucination rate, and intra-model consistency. Blinded inter-rater reliability for gold-standard item detection was assessed on a 30% sample. Results: Consensus-level macro F1 ranged from 0.41 (Claude Sonnet 4.6) to 0.49 (Grok 4.1 Fast) (Friedman P < 0.001). Phosphate binder timing (11%) and hyperkalemia combinations (33%) were poorly detected across all models. Safety-critical omission rate ranged from 22% to 48% (P < 0.001); PI-adjudicated hallucination ranged from 0% (GPT-5.4) to 54% (DeepSeek R1), including fabricated dose caps and non-existent guideline citations. Blinded reliability for gold-standard item detection was high (kappa = 0.934, n = 92). Conclusions: This nephrology-specific benchmark exposes clinically important LLM blind spots that generic multiple-choice evaluations would not detect. Heterogeneous hallucination and omission rates indicate that model selection and domain-specific guardrails should precede any clinical deployment of LLM-assisted CKD medication review. Prospective validation with real patient data and human comparators is required before deployment recommendations can be made.

9
A Three-Tier Operational Benchmark for Evaluating Large Language Models on Hospital Medication Safety

Proulx, J.; Daines, B.; Barton, M.; Leonard, M. E.; Garcia, J. A.; Young, B.; Snell, Q.; West, T. W.; Watson, S. R.; AlQaseer, M.; Louiset, M.; Maqsood, M. B.; Voutt-Goos, M. J.; Douma, C.; Kasbekar, N.; Jeffries, J.; Abu-Rahmeh, W.; Frush, K.; Grewal, D. K.; Bahsoun, M.; Leonard, M.; Frankel, A.; Classen, D. C.; Pestotnik, S. L.

2026-06-10 health informatics 10.64898/2026.06.05.26354271 medRxiv
Top 0.1%
8.4%
Show abstract

Objective. To introduce PsiBench, a clinically validated medication-safety benchmark for evaluating large language models (LLMs) against the standards used to certify hospital computerized provider order entry (CPOE) and electronic health record (EHR) systems, and a non-overlapping three-tier evaluation framework separating highest-stakes discrimination, the operational CDS regime, and category-correct alerting. Materials and Methods. PsiBench comprises 492 medication-safety scenarios across 11 safety categories, created by clinical pharmacology experts whose work underpins an annualized testing procedure used by more than 2,000 U.S. hospitals. The three-tier framework partitions the scenarios non-overlappingly: Discrimination (98 scenarios, 50 fatal vs 48 deception, near-balanced 51%/49%); Operational (394 scenarios, 261 serious unsafe plus 133 safe including 41 Excessive Alerts reclassified as operational negatives); and Attribution (311 alert-required scenarios). We evaluated 40 frontier LLMs from 10 providers over 3 runs per scenario at temperature 0.2 (or the provider default where temperature is not configurable), yielding 59,040 evaluations conducted April 21-23, 2026. Results. Headline binary performance on the full benchmark spans a wide range across the 40 models: F1 78.5%-92.3%, accuracy 65.4%-89.8%, sensitivity 81.4%-100.0%, specificity 6.1%-81.8%. Leading models by F1 (o4-mini 92.3%; o3 92.2%) pair high sensitivity with meaningful specificity; three models saturate sensitivity at 100% but fall below 25% specificity, indistinguishable from a naive always-alert classifier. The wide spread on a single headline metric motivates tier-specific analyses, developed in a separate clinical paper. Discussion and Conclusion. PsiBench and the three-tier framework operationalize a rigorous evaluation rubric for LLM medication safety, grounded in two decades of national hospital audit experience. The framework generalizes to any binary medication-safety classifier (rule-based, conventional ML, or LLM-driven), supporting tier-aware model selection and post-deployment surveillance.

10
Phenome-Wide Association Study of Pre-Cancer Diagnosis Electronic Health Records Identifies Risk and Inverse Associations in the All of Us Research Program

Rich, C. C. D.; Bang, E. J.; Bair, A. B.; Richardson, B. E.; Millington, J. L.; Bates, B. A.; Davis, M. F.; Bailey, M. H.

2026-05-28 health informatics 10.64898/2026.05.26.26353823 medRxiv
Top 0.1%
6.9%
Show abstract

Background: The All of Us Research Program represents a rich resource for cancer epidemiology research, with over 400,000 participants with whole genome sequences linked to electronic health records (EHR). Large cancer datasets often focus exclusively on cases without controls and neglect pre-diagnosis healthcare occurrences. Here, we perform a phenome-wide association study (PheWAS) of EHR data at least 1 year pre-diagnosis between cancer cases and matched controls, revealing co-occurring and mutually exclusive phenotypes. Methods: We identified 55,000+ cancer cases across 21 cancer types in All of Us version 8. To eliminate age-related confounding, we implemented a two-stage matching and censoring strategy: loose matching on demographics to establish index dates and cohort comparability, followed by right-censoring of EHR data (excluding 1 year pre-diagnosis/index), then 1:2 matching to address residual demographic imbalance. We tested associations between 23,193 cancer cases, 46,386 matched controls and approximately 1,600 clinical phenotypes using logistic regression adjusted for sex at birth, self-reported race, age at diagnosis/index date, and two censored EHR metrics: observation window and unique condition count, with Bonferroni correction for multiple testing. Results: Our analysis identified 232 significantly associated phenotypes, confirming established cancer risk factors including elevated prostate specific antigen (OR = 2.92, 95% CI: 2.65-3.23; p-value=1.8x10-101) and multinodular goiter (OR = 1.73, 95% CI: 1.56-1.91; p-value=6.7x10-27). Further investigation into the relationship between several phenotypes with seeming inverse effects is warranted. Conclusions: This PheWAS of EHR data at least 1 year pre-diagnosis leveraged the diversity of All of Us to examine how clinical phenotypes prior to cancer diagnosis vary across cancer types and racial groups. Our findings validate All of Us as a robust platform for cancer epidemiology research, confirming established risk factors at scale across diverse populations. This work provides methodological insights for EHR-based susceptibility analyses and demonstrates the value of agnostic phenome-wide approaches for generating hypotheses in precision medicine.

11
Extraction of Human Phenotype Ontology (HPO) Concepts from Clinical Notes Utilizing Large Language Models (LLM) with Model Context Protocol (MCP)

Larsen, M. E.; Campbell, I. M.; Orlando, L. A.; Robinson, P.; Walton, N. A.

2026-05-25 health informatics 10.64898/2026.05.23.26353963 medRxiv
Top 0.2%
6.7%
Show abstract

Background: Accurate extraction of Human Phenotype Ontology (HPO) terms from clinical notes is essential for variant prioritization and genetic diagnosis. Large language models (LLMs) often struggle to balance precision, hallucination avoidance, and ontology mapping accuracy, and prior work has shown that retrieval-based grounding can improve performance for individual models. We hypothesized that real-time ontology grounding through external tools would improve these metrics across heterogeneous LLMs, and we evaluated the Model Context Protocol (MCP), a standardized open framework for integrating external tools, as a vendor-agnostic mechanism for delivering such grounding. Methods: Five LLMs (Claude Sonnet 4.5, GPT-5.1, Gemini 2.5 Pro, Grok 4.1, and Qwen3 30B) extracted HPO terms from four synthetic clinical genetics notes under two conditions: baseline ("No Tools," internal knowledge only) and tool-augmented ("With Tools"), with real-time HPO retrieval delivered through MCP for models with native support and through functionally equivalent native tool-calling interfaces otherwise. Each model performed [&ge;]50 runs per note per condition (>2,000 total runs). Performance was evaluated using Precision, Recall, and F1-score. Outputs were manually adjudicated to classify mapping errors and hallucinations. Results were benchmarked against a commercial EHR-based HPO extraction tool. Results: Tool augmentation significantly improved performance across all models. Mean aggregate F1-score increased from 0.46 (SD 0.22) in the baseline condition to 0.72 (SD 0.15) with tools (p < 0.001). Mapping Error Rate decreased from 40.9% to 7.8% (p < 0.001), and Precision increased from 56% to 90%. Performance gains were observed across all model families, including the open-weight Qwen3 model (F1 0.11[-&gt;]0.50). For inferred phenotypes, F1 improved from 0.20 to 0.34 (p < 0.001) without a significant increase in hallucination rate (p = 0.08). Compared with the commercial benchmark, tool-augmented LLMs achieved higher F1-scores and substantially greater recall for inferred phenotypes. Conclusions: Real-time ontology grounding substantially improves HPO extraction across diverse LLMs by reducing mapping errors and enhancing phenotype inference. The Model Context Protocol provides a standardized, interoperable mechanism for delivering such grounding, supporting reproducible, vendor-agnostic deployment of clinical LLM pipelines in genomic medicine.

12
Design and Usability Evaluation of a Digital Guideline Management Application for a Pediatric Cardiac Center

Heidenreich, B. M.

2026-05-26 health informatics 10.64898/2026.05.24.26353982 medRxiv
Top 0.2%
6.7%
Show abstract

Background. Complex cases in specialized pediatric care require consistent adherence to evidence-based clinical pathways and protocols to ensure safe, high-quality, and equitable care. Currently, clinical pathways and supporting documentation are frequently distributed across multiple platforms, leading to fragmentation. Human-centered design principles can guide the development of healthcare technologies that minimize cognitive load and support rapid, efficient access to relevant information in clinical settings. The purpose of this study is to design and evaluate perceived usability of a pediatric cardiac center digital guideline management system that is embedded within the electronic health record leveraging human-centered design. Methods. This study used a mixed-methods usability evaluation to assess a digital guideline management system prototype embedded into clinical workflow. Through human-centered design principles, the prototype provides a centralized digital document library that organizes cardiac-specific clinical pathways, guidelines, procedures, and related resources. A small but diverse sample, encompassing a wide variety of roles and clinical areas within the pediatric cardiac center, was recruited to evaluate the perceived usability of the prototype. Usability was evaluated by stakeholders using the validated System Usability Scale (SUS) with additional optional questions to understand perceptions of the information architecture and clinical value. Results. Preliminary usability testing showed a mean SUS composite score of 76.5, indicating above average usability. Questions related to the complexity of the system and user confidence received high scores across participants. Lower scores were observed for questions related to usage frequency and ability to learn the system very quickly. Conclusion. Leveraging human-centered design when building a digital guideline management system embedded within clinical workflow revealed positive perception from participants. By centralizing access to clinical resources, this prototype can reduce current-state fragmentation. Further evaluation of larger samples is needed to develop a list of future recommendations.

13
Relationship Extraction for Adverse Drug Events in Clinical Notes Using Large Language Models

Plasek, J. M.; Li, Y.; Amato, M. G.; Foer, D.; Seger, D. L.; Alzaidi, S.; Zhou, H.; Jackson, G. P.; Bates, D. W.; Zhou, L.

2026-06-01 health informatics 10.64898/2026.05.28.26354362 medRxiv
Top 0.2%
6.5%
Show abstract

Background: Adverse drug events (ADEs) are a critical indicator of patient safety but are often documented only in free-text clinical notes. The potential of recent advances in natural language processing (NLP), particularly generative large language models (LLMs), to identify ADEs remains understudied. This study aimed to compare the performance of multiple LLMs in identifying ADE-Drug relationships in inpatient and ambulatory clinical notes. Methods: We used clinical notes from the 2018 National NLP Clinical Challenge (n2c2) ADE dataset (inpatient; n=505) and from outpatient encounters (n=2,555) between October 1, 2018, and December 31, 2019, at a large academic medical center based in New England. Notes were pre-processed into snippets for model input. Evaluated Models included: GPT-4o, GPT-4o-mini, LLAMA 3.3-70B and their instruction fine-tuned variants (including low-rank adapters for LLAMA). Performance was assessed using both strict and relaxed evaluations (precision, recall, and F1) for all models, followed by manual evaluation (exact semantic match, partial match, missing ADE, drug mention only, not a drug, or wrong) of the two best-performing models. Results: GPT-4o and GPT-4o-mini were the top-performing models among those evaluated. GPT-4o consistently outperformed GPT-4o-mini in ADE extraction across both datasets, with higher F1-scores (0.524 vs. 0.381) and a more balanced precision-recall profile. Both models captured ADEs effectively in explicit and complex clinical contexts, although limitations included misclassification of pre-existing allergies and occasional conflation of therapeutic indications with adverse effects. GPT-4o achieved higher exact match coverage and fewer errors across clinical notes, indicating more reliable performance in both inpatient and ambulatory settings. Conclusion: This work establishes a foundation for integrating LLM methods into real-world drug safety surveillance, with direct implications for improving patient safety.

14
Augmenting Structured Diagnoses through Effective Use of Pre-trained Large Language Models on Clinical Notes

Razzaghi, H.; Nguyen, N.; Pargi, M.; Wieand, K.; Bunnell, T.; Bailey, C.

2026-06-02 health informatics 10.64898/2026.05.30.26354533 medRxiv
Top 0.2%
6.4%
Show abstract

Objective Clinical narrative provides a unique window into provider reasoning and attribution, but use has been limited by resource requirements and extensive fine-tuning, and LLMs in particular have traditionally not performed well at medical coding. We optimize and evaluate a reproducible method for automated diagnosis assignment using LLMs in clinical notes and compare with EHR structured diagnoses. Methods We used GPT-OSS for prompt engineering and task segmentation to create a model that extracts ICD-10-CM diagnoses, with estimates of severity, currency, and importance, from progress notes. We assessed performance across multiple cohorts of patients aged 0-21 years. For each, 100 outpatient provider notes were selected across levels of severity, along with coded diagnoses from that visit (EHR); a subset of 130 notes were subjected to clinical expert review. Results Comparison showed 18.7% exact code and 33.3% ICD-10-CM category match between EHR and LLM, but semantic similarity of 0.93 at the category level. Compared to expert review, LLM precision was 0.84 and recall 0.49 for exact matches, and 0.92 and 0.62, respectively, for category-level matching. In contrast, EHR coded diagnoses showed slightly higher precision (0.94 for both cases) and substantially lower recall (0.27 and 0.43) versus expert review. Codes not identified by the LLM were more often rated by the reviewer as lower importance or certainty. Conclusion We demonstrate a reusable approach to optimizing a pretrained LLM for use in diagnosis extraction from clinical notes, facilitating large-scale diagnosis screening by LLMs without the need for expensive study-specific model refinement.

15
A Comparison of Manual and Automated Approaches to Developing Computable Algorithms for Identifying Acute Pancreatitis

Bann, M. A.; Carrell, D. S.; Gruber, S.; Heagerty, P. J.; Williamson, B. D.; Nelson, J. C.; Hazlehurst, B.; Felcher, A.; Nyongesa, D. B.; Slaughter, M. T.; Sapp, D. S.; Cronkite, D. J.; Ball, R.; Floyd, J. S.

2026-06-08 health informatics 10.64898/2026.06.05.26354934 medRxiv
Top 0.2%
6.4%
Show abstract

Objective: Clinical phenotyping methods that rely on clinical and informatics expertise can be time-intensive and costly. We tested both manual and highly automated approaches using electronic health record (EHR) data to identify an FDA Sentinel Initiative health outcome of interest, acute pancreatitis. Materials and Methods: We trained and evaluated machine learning algorithms using EHR data with two approaches: a custom approach that included manually curated features and trained on outcomes data validated with medical record review, and a highly automated approach that greatly simplifies and automates feature engineering and relies on low-cost silver-standard outcomes for model training. Results: Custom algorithms using manually curated structured claims data discriminated cases from non-cases with a high degree of accuracy (cv-AUC 0.89 [95%CI 0.84-0.94]); the inclusion of natural language processing (NLP)-derived covariates from clinical notes increased performance slightly (cv-AUC 0.91[95%CI 0.86-0.97]). The automated algorithm trained on the outcome count of diagnosis codes performed less well (AUC 0.80 [95% CI 0.75-0.85]) but improved using maximum lipase value as an outcome (AUC 0.88 [95% CI 0.84-0.92]). At a positive predictive value of 90%, the custom algorithm had a sensitivity of 92%, the automated algorithm trained on diagnosis code count had a sensitivity of 45%, and the automated algorithm trained on maximum lipase value had a sensitivity of 84%. However, a prediction rule derived by clinicians during chart review was nearly as accurate (maximum lipase value [&ge;] 3 times upper limit of normal; AUC 0.86, PPV 85%, sensitivity 92%). Discussion: Machine learning algorithms with manually curated structured data and NLP features trained on validated outcomes data successfully identified validated events. Use of an outcome in the automated model based on specific phenotype knowledge (maximum lipase value) allowed for performance similar to the custom model and with considerably less resources.

16
PatientEvent: An Event-Based Ontology for Patient-Initiated Portal Communication

Gatto, J.; Yang, J.; Seegmiller, P.; Rahat, R.; Burdick, T.; Preum, S. M.

2026-06-03 health informatics 10.64898/2026.06.01.26354623 medRxiv
Top 0.2%
6.4%
Show abstract

Patient portal messaging has become a primary channel for asynchronous clinical communication, it spans a wide range of content, from symptom reports and medication concerns to administrative requests. Despite this volume and diversity, there is no formal representation for what a portal message contains: no vocabulary for the clinical and administrative events it describes, or for the attributes of those events that the patient has actually disclosed. Without such a representation, it is difficult to systematically analyze portal communication, assess message completeness, or build downstream tools that depend on structured input, such as automated triage, response drafting, and follow-up question generation. A clinical event schema, grounded in real portal messages and reviewed by clinicians, would provide this missing foundation. We introduce a clinical event ontology for patient portal messages, containing 8 event types and 70 roles that span clinical content (symptoms, medications, diagnostic tests, treatment responses, patient history) and administrative content (medical needs, logistics, social factors). The ontology was developed iteratively in collaboration with clinical expert and human evaluation. As a downstream application, we use the ontology to characterize the event types and roles most frequently sought in clinician follow-up questions, which provides insight of what clinicians ask about when reading portal messages.

17
A Retrospective Evaluation of the Microsoft Healthcare Agent Orchestrator for Tumor Board Patient Summaries

Roy, J.; Korleski, J. B.; Augustin, R. C.; Yefet, L.; Jensen, Z. D.; Ehman, E. C.; Zadeh, G.; Conners, A. L.; Tevaarwerk, A. J.; Korfiatis, P.

2026-06-01 health informatics 10.64898/2026.05.22.26353812 medRxiv
Top 0.2%
6.3%
Show abstract

Background: Preparing tumor board patient summaries is time intensive. Large-language-model based systems may automate summarization but require real-world evaluation prior to clinical use. We performed an exploratory retrospective evaluation of the Microsoft Healthcare Agent Orchestrator (HAO), deployed in a Mayo Clinic controlled staged environment, to generate tumor board-style patient summaries from retrospective Electronic Health Record (EHR) notes. Methods: HAO generated summaries for breast, hepatobiliary, and neuro-oncology tumor board cases using up to the most recent 1,000 clinical notes. Clinician reviewers evaluated outputs via REDCap surveys across perceived factuality, completeness, clarity/conciseness, temporal cohesion, comparative performance, safety, and clinical utility (0-4 Likert scale). Reviewers were permitted to query the HAO chat interface to address missing details. Automated factuality was assessed using TBFact (bidirectional entailment), reporting precision and recall against available reference summaries. Results: Among 57 survey responses from 5 different physicians, mean scores exceeded 2.8 across domains, with medians of 3 for most axes. In an exploratory comparison, oncology fellows required less time to review HAO-generated summaries than to manually generate patient summaries (mean difference 13.57 minutes per patient, p<0.001), although this difference may be influenced by prior familiarity with the same cases; 96% of survey responses indicated that HAO would save time. TBFact evaluations showed higher recall than precision across domains, consistent with broad capture of reference content alongside additional content that was not present in gold-standard summaries. Attribution was viewed favorably but showed issues with primary-source specificity and link reliability. Conclusions: In a controlled Mayo environment, HAO demonstrated moderate performance and was associated with reduced review time for tumor board preparation. These findings are promising but preliminary and do not establish clinical safety, noninferiority to manual review, or readiness for routine clinical use. Limitations, including verbosity, specialty-specific content gaps, and inconsistent attribution, highlight the need for iterative refinement and further evaluation.

18
A hierarchical clinical fusion transformer model for personalized opioid treatment: Development and validation in diabetic surgical patients

Naderalvojoud, B.; Sutjiadi, B. J.; Koul, A.; Curtin, C.; Gevaert, O.; Hernandez-Boussard, T.

2026-06-08 health informatics 10.64898/2026.06.04.26353331 medRxiv
Top 0.2%
6.2%
Show abstract

Background Machine learning (ML) models are increasingly used to predict adverse outcomes after surgery. However, most rely on static patient characteristics (e.g., age, comorbidities) and overlook clinician-controlled treatment decisions that can be actively modified at the point of care. Discharge opioid prescribing is a key modifiable, clinician-controlled decision, yet optimizing prescribing choices across multiple adverse outcomes remains underexplored in predictive modeling. This study addresses that gap by introducing a novel ML framework that explicitly separates fixed patient risk factors from modifiable prescribing options to support personalized, risk-informed opioid prescribing decisions. Methods We developed the Hierarchical Clinical Fusion Transformer (HCF-Transformer), an ML model designed to estimate patient-specific risks across four postoperative outcomes: prolonged opioid use (POU), chronic pain (CP), 30-day readmission, and opioid-associated outcomes (OAO). The model constructs patient risk profiles from fixed, non-modifiable baseline factors, followed by a transformer layer. Clinician-controllable discharge opioid regimens are modeled as alternative intervention candidates and fused with the fixed risk representation through a clinical fusion mechanism, enabling assessment and ranking based on predicted risks. A Total Relative Risk (TRR) metric, calibrated to each outcome prediction threshold, guides the recommendation process. We evaluated the model in diabetic surgical patients, a common high-risk population. Results The study included 157,853 unique diabetic surgical patients, with outcome prevalences ranging from 47.2% (POU) to 1.8% (OAO). The HCF-Transformer achieved the highest AUROCs, 0.798 for POU, 0.712 for 30-day readmission, 0.808 for CP, and 0.922 for OAO, outperforming Random Forest, FT-Transformer, and ResNet-based models. Compared to these baselines, HCF-Transformer generated more stable and discriminative risk estimates and demonstrated significant variation in TRR scores across discharge opioid options (ANOVA p < .01, eta-squared > .01). This enabled consistent identification of lower-risk regimens tailored to patient-specific profiles. Conclusions The HCF-Transformer introduces a novel hierarchical fusion approach to optimize opioid prescribing by integrating static patient risk profiles with modifiable discharge options. Using transformer-based modeling and a quantifiable TRR metric, the model delivers personalized, risk-aware recommendations. This approach enables data-driven opioid prescribing tailored to individual risk and has the potential to improve postoperative outcomes in high-risk populations. Our findings demonstrate that integrating modifiable factors with structured risk profiles through a transformer-based fusion architecture can enhance decision-support systems, paving the way for more actionable and personalized AI in healthcare.

19
Quality and Safety profiles of AI-Generated vs Clinician-Generated Handoffs in Hospital Medicine

Shah, K. P.; Airan Javia, S.; Savage, T.; Bressman, E.

2026-06-08 health informatics 10.64898/2026.06.05.26354946 medRxiv
Top 0.2%
6.2%
Show abstract

End-of-rotation handoffs are critical for patient safety but add to documentation burden for hospitalists. Generative artificial intelligence (AI) may help automate handoff creation using electronic health record data, but its impact on quality and safety is unclear. Methods: We developed an AI handoff tool with a large language model using clinical notes as input and conducted a retrospective evaluation comparing AI-generated and clinician-authored handoffs. Handoffs were assessed across domains of quality and safety through a structured review. Results: Quality ratings were similar between AI and human handoffs (3.7 vs. 3.5, p=0.57). AI-generated handoffs were rated higher for organization (4.4 vs. 4.1, p=0.05) and completeness (4.1 vs. 3.6, p=0.01), but lower for conciseness (3.7 vs. 4.1, p=0.03) and accuracy (4.1 vs. 4.4, p=0.03). Error rates were comparable (0.3/handoff in both groups); however, AI-generated handoffs included inaccuracies (9% of AI errors) and hallucinations (1% of AI errors), while clinician-authored handoffs contained only omissions. Conclusion: Human and AI handoffs have differing error profiles and tradeoffs between completeness and conciseness. Prospective evaluation in clinical workflows is underway.

20
Study Design Indexing in Transition: A Focused Comparison of manual NLM Indexing vs. Transformer-based Automated Models

Das, P.; Schneider, J.; Mayo-Wilson, E.; Kilicoglu, H.; Menke, J. D.; Nam, D.; Ninan, K.; Oberste, J.-P.; Troy, A. M.; Ying, X.; Holt, A. W.; Smalheiser, N. R.

2026-06-04 health informatics 10.64898/2026.06.03.26354854 medRxiv
Top 0.2%
4.9%
Show abstract

Objectives: Study design indexing of biomedical publications is crucial for evidence retrieval and synthesis. We sought to evaluate the accuracy and suitability of a transformer-based model (TM) for indexing clinical study designs, in comparison to National Library of Medicine (NLM) indexing. However, this is challenging for at least three reasons: First, to date, all automated systems have been trained and evaluated on manual NLM indexing assignments, itself subject to errors. Second, TM's probabilistic predictive scores take into account uncertainty, and can be converted to TRUE/FALSE assignments in different ways depending on the needs of users, while NLM labels are categorical. Third, our goal (to tag articles only that exhibit a given design) differs from NLM which tags articles that both discuss as well as exhibit that design. Materials and Methods: Therefore, we carried out a limited evaluation of the TM model that focuses only on the articles that received the most confident predictions, that is, the highest scores that are almost certainly TRUE and the lowest scores that are almost certainly FALSE, but which disagreed with NLM assignments. This was performed both for articles published in 2016 (when NLM decisions were manual) and in 2025 (when NLM decisions were automated). To establish ground truth, dual annotators indexed the articles independently, following written definitions, for four prominent study designs--cohort, case-control, cross-sectional, and case report. Results: For three designs (case-control, case report, cross-sectional), the articles having the top 100 predictive TM scores (when NLM failed to assign that design) were judged to exhibit that design in the great majority (86-100%) of cases. Conversely, the articles having the lowest 100 predictive TM scores (when NLM did assign the study design) exhibited the design only in relatively few (0-21%) of cases. The most confident predictions of the TM model were highly accurate and not redundant with automated NLM indexing; the exception was cohort studies articles, in which both TM and NLM labels showed high error rates of both omission and commission. Discussion and Conclusion: TM may have value for identifying articles exhibiting study designs, which is especially important for clinical decision-making as well as systematic reviews and other evidence syntheses. NLM indexing of cohort studies cannot be regarded as a reliable gold standard for training or evaluation of automated systems, warranting efforts to create a new manually annotated corpus.